Skip to content

Transpile INTERSECTS to binned equi-join in pure SQL for full-table joins — Closes #78#79

Draft
conradbzura wants to merge 20 commits intomainfrom
78-binned-equijoin-intersects
Draft

Transpile INTERSECTS to binned equi-join in pure SQL for full-table joins — Closes #78#79
conradbzura wants to merge 20 commits intomainfrom
78-binned-equijoin-intersects

Conversation

@conradbzura
Copy link
Copy Markdown
Collaborator

@conradbzura conradbzura commented Mar 31, 2026

Summary

Add IntersectsBinnedJoinTransformer to rewrite column-to-column INTERSECTS joins into binned equi-joins using UNNEST(range(...)) CTEs. The generated SQL is portable across DuckDB, DataFusion, and PostgreSQL — no runtime extensions required. Three rewrite strategies are selected automatically: a full-CTE path for non-wildcard SELECTs, a key-only bridge CTE pattern for wildcard SELECTs, and a pairs-CTE path for outer joins. SELECT DISTINCT is added unconditionally to deduplicate rows from multi-bin matches.

Outer joins (LEFT, RIGHT, FULL) use a dedicated pairs-CTE approach that computes matching key pairs via an INNER binned join, then outer-joins the original tables through this pairs CTE. This avoids the bin fan-out that creates spurious NULL rows when an interval spans multiple bins but only matches in some of them — matching how Databricks and Snowflake handle the problem (they restrict binning to INNER joins entirely).

Bin indices use integer floor division (//) instead of float division + CAST to avoid rounding errors at bin boundaries.

Closes #78

Proposed changes

Binned equi-join transformer

Add IntersectsBinnedJoinTransformer in src/giql/transformer.py. Each interval is assigned to bins via UNNEST(range(start // B, (end - 1) // B + 1)). The transformer handles explicit JOIN ON, implicit cross-join (FROM a, b WHERE ...), self-joins, multi-table joins, and custom column mappings. Bin size defaults to DEFAULT_BIN_SIZE (10,000) and is configurable via the bin_size parameter on transpile().

Three rewrite strategies

  • Full-CTE path — replace table references with SELECT *, __giql_bin CTEs and rewrite the JOIN ON. Used when the SELECT list has no wildcards and no outer joins are present.
  • Bridge path — create key-only SELECT chrom, start, end, __giql_bin CTEs and a three-join chain that keeps original table references intact, preventing __giql_bin from appearing in a.* expansion. Used for wildcard SELECTs with INNER joins.
  • Pairs-CTE path — compute matching (left_key, right_key) pairs via an INNER binned join with DISTINCT, then outer-join the original tables through the pairs CTE. Used for any query containing outer joins with INTERSECTS predicates. This avoids the bin fan-out problem where multi-bin intervals create spurious NULL rows.

Integer floor division for bin indices

Replace CAST(start / B AS BIGINT) with start // B. Float division followed by CAST rounds to nearest on engines like DuckDB (e.g., CAST(621950 / 100 AS BIGINT) yields 6220 instead of 6219), causing missed matches at bin boundaries.

Outer join and extra ON condition handling

Propagate the join side (LEFT, RIGHT, FULL) to both replacement joins in the pairs-CTE path. Extract non-INTERSECTS siblings from AND trees in ON clauses via _extract_non_intersects() and re-attach them to the rewritten join.

Code quality improvements

Move DEFAULT_BIN_SIZE to constants.py and export from giql.__init__. Extract shared _build_bin_range() helper to eliminate duplicate bin-computation logic. Replace mutable-list connector counter with itertools.count. Add isinstance check for bin_size to reject floats early. Rewrite _remove_intersects_from_where to handle deeply-nested AND trees cleanly.

Documentation

Document the DISTINCT deduplication behavior in docs/dialect/spatial-operators.rst under a new "Deduplication Behavior" subsection of INTERSECTS, explaining the mechanism, the edge case where genuinely identical rows are collapsed, and the mitigation of including a distinguishing column.

Test cases

# Test Suite Test ID Given When Then Coverage Target
1 TestTranspileBinnedJoin BJ-001 A GIQL query joining two tables with INTERSECTS Transpiling with default settings SQL contains CTEs with UNNEST/range, equi-join ON, and DISTINCT Basic rewrite structure
2 TestTranspileBinnedJoin BJ-002 A GIQL query with custom bin_size=5000 Transpiling Bin size 5000 appears in the generated SQL Custom bin size
3 TestTranspileBinnedJoin BJ-003 Tables with custom column mappings Transpiling Custom column names appear in the generated SQL Column mapping
4 TestTranspileBinnedJoin BJ-004 An INTERSECTS with a literal range string Transpiling No binned CTEs are generated Literal passthrough
5 TestTranspileBinnedJoin BJ-005 A query with no JOIN Transpiling Query passes through unchanged No-join passthrough
6 TestTranspileBinnedJoin BJ-006 A query with WHERE filter alongside INTERSECTS Transpiling Original WHERE conditions are preserved WHERE preservation
7 TestTranspileBinnedJoin BJ-007 bin_size=None Transpiling Default bin size 10000 is used Default bin size
8 TestTranspileBinnedJoin BJ-008 Implicit cross-join with WHERE INTERSECTS Transpiling Binned optimization is applied Implicit cross-join
9 TestTranspileBinnedJoin BJ-009 Self-join on the same table Transpiling Only one shared bin CTE is created Self-join dedup
10 TestTranspileBinnedJoin BJ-010 bin_size=0 or negative Transpiling ValueError is raised Validation
11 TestTranspileBinnedJoin BJ-011 Three tables with two INTERSECTS joins Transpiling Both joins are rewritten with separate CTEs Multi-join
12 TestTranspileBinnedJoin BJ-012 SELECT with explicit column names Transpiling Full-CTE path is used (no bridge) Strategy selection
13 TestTranspileBinnedJoin BJ-013 SELECT with wildcards Transpiling Bridge path is used (key-only CTEs) Strategy selection
14 TestBinnedJoinDataFusion DF-001 Overlapping intervals across two tables Executing binned join SQL Correct rows returned with no duplicates Correctness
15 TestBinnedJoinDataFusion DF-002 Non-overlapping intervals Executing binned join SQL Zero rows returned Non-overlap
16 TestBinnedJoinDataFusion DF-003 Adjacent intervals (half-open coordinates) Executing binned join SQL Zero rows returned Half-open semantics
17 TestBinnedJoinDataFusion DF-004 Intervals on different chromosomes Executing binned join SQL Only same-chromosome matches returned Chromosome filter
18 TestBinnedJoinDataFusion DF-005 Intervals spanning multiple bins Executing binned join SQL Correct results with no duplicate rows Multi-bin dedup
19 TestBinnedJoinDataFusion DF-006 Binned join vs naive cross-join Executing both Results are identical Equivalence
20 TestBinnedJoinDataFusion DF-007 Implicit cross-join syntax Executing binned join SQL Correct rows, no __giql_bin in output Column leak prevention
21 TestBinnedJoinOuterJoinSemantics OJ-001 LEFT JOIN with unmatched left rows (full-CTE) Executing Unmatched left rows appear with NULL right columns LEFT JOIN full-CTE
22 TestBinnedJoinOuterJoinSemantics OJ-002 LEFT JOIN with unmatched left rows (bridge) Executing Unmatched left rows appear with NULL right columns LEFT JOIN bridge
23 TestBinnedJoinOuterJoinSemantics OJ-003 RIGHT JOIN with unmatched right rows (full-CTE) Executing Unmatched right rows appear with NULL left columns RIGHT JOIN full-CTE
24 TestBinnedJoinOuterJoinSemantics OJ-004 RIGHT JOIN with unmatched right rows (bridge) Executing Unmatched right rows appear with NULL left columns RIGHT JOIN bridge
25 TestBinnedJoinOuterJoinSemantics OJ-005 FULL OUTER JOIN (full-CTE) Executing Both unmatched sides appear with NULLs FULL OUTER full-CTE
26 TestBinnedJoinOuterJoinSemantics OJ-006 FULL OUTER JOIN (bridge fallback) Executing Both unmatched sides appear with NULLs FULL OUTER bridge fallback
27 TestBinnedJoinOuterJoinSemantics OJ-007 LEFT JOIN where no rows match Executing All left rows returned with NULL right columns All-unmatched LEFT
28 TestBinnedJoinAdditionalOnConditions AC-001 Extra equality in ON alongside INTERSECTS (full-CTE) Executing Extra condition filters results correctly Extra ON full-CTE
29 TestBinnedJoinAdditionalOnConditions AC-002 Extra equality in ON alongside INTERSECTS (bridge) Executing Extra condition filters results correctly Extra ON bridge
30 TestBinnedJoinAdditionalOnConditions AC-003 Extra ON condition with LEFT JOIN Executing Unmatched rows preserved, extra filter applied Extra ON + LEFT
31 TestBinnedJoinAdditionalOnConditions AC-004 Multiple extra conditions in ON Executing All extra conditions preserved and applied Multiple ON conditions
32 TestBinnedJoinAdditionalOnConditions AC-005 Extra WHERE condition with implicit cross-join Executing WHERE condition applied after INTERSECTS rewrite Extra WHERE cross-join
33 TestBinnedJoinDistinctSemantics DS-001 Duplicate source rows with no unique column (full-CTE) Executing Duplicates collapsed (xfail, known limitation) DISTINCT limitation
34 TestBinnedJoinDistinctSemantics DS-002 Duplicate source rows with no unique column (bridge) Executing Duplicates collapsed (xfail, known limitation) DISTINCT limitation
35 TestBinnedJoinDistinctSemantics DS-003 Rows with distinguishing column Executing All distinct rows preserved DISTINCT with unique col
36 TestBinnedJoinDistinctSemantics DS-004 User-specified DISTINCT already in query Executing Still works correctly Idempotent DISTINCT
37 TestBinnedJoinBinBoundaryRounding BR-001 Interval B at a .5 division boundary (621950/100) Executing on DuckDB with bin_size=100 Overlap is found, not missed by rounding Float division rounding
38 TestBinnedJoinBinBoundaryRounding BR-002 Interval B starting at exact multiple of bin_size Executing on DuckDB Correct bin assigned, no off-by-one Exact boundary
39 TestBinnedJoinOuterJoinMultiBin OM-001 LEFT JOIN with A spanning bins 0-1, B only in bin 1 Executing 1 matched row, no spurious NULL from bin 0 Multi-bin LEFT
40 TestBinnedJoinOuterJoinMultiBin OM-002 LEFT JOIN with A having no overlap in B Executing 1 row with NULL B columns Unmatched LEFT preserved
41 TestBinnedJoinOuterJoinMultiBin OM-003 RIGHT JOIN with B spanning bins 0-1, A only in bin 0 Executing 1 matched row, no spurious NULL from bin 1 Multi-bin RIGHT
42 TestBinnedJoinOuterJoinMultiBin OM-004 FULL OUTER JOIN with A spanning bins 0-1, B in bin 1 Executing 1 matched row, no spurious NULL Multi-bin FULL
43 test_intersect_property PB-001 Two random interval sets (up to 60 each, up to 200kb) GIQL INTERSECTS vs bedtools intersect -u Results match exactly (50 examples) Randomized correctness
44 test_intersect_property PB-002 One random interval set GIQL self-join vs bedtools self-intersect Results match exactly (30 examples) Self-join correctness
45 test_intersect_property PB-003 Two random interval sets, bin_size in {100, 1k, 10k, 100k} GIQL INTERSECTS vs bedtools Results match regardless of bin size (40 examples) Bin-size independence
46 test_intersect_property PB-004 Three random interval sets (up to 8 each) GIQL three-way JOIN vs chained bedtools intersect A-side rows match (40 examples) Multi-table join
47 test_intersect_property PB-005 Two random interval sets (up to 30 each) GIQL LEFT JOIN vs bedtools intersect -loj Results match exactly (40 examples) LEFT JOIN correctness

Column-to-column INTERSECTS joins (e.g., a.interval INTERSECTS
b.interval) are now rewritten into binned equi-joins using CTEs with
UNNEST(range(...)) bin assignments. This gives the query planner an
equi-join key to work with instead of forcing a nested-loop or cross
join. The bin size defaults to 10,000 and is configurable via the
new bin_size parameter on transpile(). Literal-range INTERSECTS
filters remain unchanged.
Needed for end-to-end correctness tests that validate the binned
equi-join SQL against DataFusion's query engine.
The transformer now detects column-to-column INTERSECTS in WHERE
clauses (FROM a, b WHERE a.interval INTERSECTS b.interval), not
just in explicit JOIN ON conditions. Both patterns are rewritten
to binned equi-joins for the same performance benefit.
Covers both explicit JOIN ON and implicit cross-join patterns,
custom bin sizes, custom column mappings, self-joins, literal
range passthrough, and end-to-end correctness against DataFusion
including multi-bin deduplication and equivalence with naive joins.
Move the overlap predicate (start < end AND end > start) from WHERE
into the JOIN ON clause so that LEFT/RIGHT/FULL JOIN semantics are
preserved — a WHERE filter on the right-side columns silently converts
outer joins into inner joins.

Also refactor the transformer to rewrite all INTERSECTS joins in a
query, not just the first. A new _ensure_table_binned helper tracks
which aliases already have binned CTEs so that multi-join queries
reuse CTEs instead of duplicating them.

Add bin_size validation (must be positive) and remove dead code from
_rewrite_where.
Cover three-way joins with CTE reuse, invalid bin_size rejection,
and update assertions for the overlap-in-ON change. Remove unused
pytest import from module level.
The binned CTE approach leaks __giql_bin into SELECT * results because
CTEs expose all their columns. Revert implicit cross-join rewriting
(FROM a, b WHERE INTERSECTS) so those queries fall through to the
generator's naive overlap predicate, which produces clean column output.
Explicit JOIN ON INTERSECTS continues to use the binned equi-join.

Also add pytest.importorskip for datafusion so the DataFusion
correctness tests are skipped when the module is not installed.
The CI workflow uses pixi, not uv, so the datafusion package must
be listed under [tool.pixi.dependencies] for the DataFusion
correctness tests to run. Remove the pytest.importorskip guard
since the dependency is now always available.
The previous approach replaced FROM/JOIN table references with full
CTEs (SELECT *), causing __giql_bin to appear in SELECT a.* output.
The new approach keeps original table references and routes the equi-
join through key-only bridge CTEs (SELECT chrom, start, end, bin),
eliminating the leak entirely.

This also restores implicit cross-join rewriting (FROM a, b WHERE
INTERSECTS) which was reverted in the prior commit due to the leak.
CTEs are now named __giql_{table}_bins and deduplicated per underlying
table name rather than per alias, so self-joins share one CTE.
Queries with explicit column lists (SELECT a.chrom, b.start, ...)
cannot expose __giql_bin in their output regardless of which CTE
the table alias points to. Detecting this at transform time lets
us skip the 3-join bridge pattern entirely for those queries and
use the simpler, faster 1-join full-CTE approach.

Queries with wildcards (SELECT a.*, SELECT *) still take the bridge
path so __giql_bin never leaks into the output column set.
Drop section divider lines (`# --...--`) from `IntersectsBinnedJoinTransformer`
to reduce visual clutter. Descriptive inline comments explaining code behavior
are preserved.
Cover outer join semantics (LEFT/RIGHT/FULL preserved through both
full-CTE and bridge paths), additional ON conditions surviving the
rewrite alongside INTERSECTS, and unconditional DISTINCT collapsing
legitimate duplicate rows. The DISTINCT tests are marked xfail since
the correct behavior (preserving duplicates) is a known limitation.

7 tests fail against the current implementation, confirming the bugs.
2 tests are strict xfail documenting the DISTINCT limitation.
Two interrelated fixes for the binned equi-join rewrite:

The bridge path was silently converting LEFT/RIGHT/FULL joins to
INNER because sqlglot stores the join type as "side" not "kind",
and only join3 received it. Propagate the side attribute to both
join2 and join3. FULL OUTER with wildcards falls back to the
full-CTE path because the three-join chain's bin fan-out creates
spurious unmatched rows that DISTINCT cannot resolve.

Both rewrite paths were replacing the entire ON clause with the
binned equi-join and overlap predicate, silently dropping any
user-supplied conditions alongside INTERSECTS. Extract non-
INTERSECTS conditions from the original ON and AND them back into
the rewritten clause.
DISTINCT is added unconditionally to column-to-column INTERSECTS joins
to eliminate duplicates from the bin fan-out. This section explains the
mechanism, the edge case where it can collapse genuinely identical source
rows, and the mitigation of including any distinguishing column in the
SELECT list.
Move DEFAULT_BIN_SIZE to constants module and export from __init__.
Extract shared _build_bin_range helper to eliminate duplicate
bin-computation logic between the two CTE builders.  Replace the
mutable-list connector counter with itertools.count.  Add isinstance
check for bin_size so floats are rejected early.  Rewrite
_remove_intersects_from_where to use _extract_non_intersects so
deeply-nested AND trees are handled cleanly.  Expand docstrings on
the class, __init__, _find_column_intersects_in, and
_build_join_back_joins to document assumptions and limitations.
Use hypothesis to generate random intervals spanning multiple bins
and verify that the binned equi-join produces identical results to
bedtools intersect -u.  Three tests cover two-table joins, self-joins,
and varying bin sizes (100 to 100k).  Intervals use unique names to
avoid the known DISTINCT duplicate-collapse limitation.
Two correctness fixes for the binned equi-join rewrite:

1. Bin index rounding: CAST(start / B AS BIGINT) uses float division,
   so values like 621950/100 = 6219.5 round to 6220 instead of
   flooring to 6219.  Replace Div+Cast with IntDiv (//) which does
   proper integer floor division on all engines.

2. Outer join spurious NULLs: when an interval spans multiple bins,
   the LEFT/RIGHT/FULL outer join produces one row per bin.  Bins
   that don't match the other side create NULL rows even though the
   same source row matches via a different bin.  DISTINCT can't
   collapse these because NULL and non-NULL rows differ.  Add a
   pairs-CTE approach that computes matching (left_key, right_key)
   pairs via an INNER binned join with DISTINCT, then outer-joins
   the original tables through this pairs CTE.  This matches the
   pattern used by Databricks and Snowflake, which restrict binning
   to INNER joins and use separate logic for outer join semantics.
Add regression tests for both bugs found by property-based testing:

- TestBinnedJoinBinBoundaryRounding: verifies that overlaps at .5
  division boundaries are not dropped by float rounding (DuckDB).
- TestBinnedJoinOuterJoinMultiBin: verifies that LEFT, RIGHT, and
  FULL OUTER joins with multi-bin intervals produce no spurious NULL
  rows (DataFusion).

Add property-based bedtools correctness tests:

- test_multi_table_join_matches_bedtools: three-way INTERSECTS join
  compared against chained bedtools intersect.
- test_left_join_matches_bedtools_loj: LEFT JOIN INTERSECTS compared
  against bedtools intersect -loj.

Add -loj support to the bedtools wrapper for left outer join output.
Extend the bedtools wrapper to support -v, -wa -wb, -c, -wo, -wao,
-f, -F, and -r flags.  Add property-based tests that compare GIQL
queries against each bedtools intersect mode: inverse (-v via LEFT
JOIN anti-join), write-both (-wa -wb via full pair SELECT), count
(-c via GROUP BY COUNT), same-strand (-s), opposite-strand (-S),
minimum overlap fraction of A (-f), minimum overlap fraction of B
(-F), and reciprocal fraction (-f -r).

Total: 520 randomized examples across 13 property-based tests
covering all bedtools intersect overlap flags.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Transpile INTERSECTS to binned equi-join in pure SQL for full-table joins

1 participant